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Abstract 

Compressed sensing (CS) is an important theory for sub-Nyquist sampling and recovery of 
compressible data. Recently, it has been extended by Pham and Venkatesh [31] to cope with the 
case where corruption to the CS data is modeled as impulsive noise. The new formulation, termed 
as robust CS, combines robust statistics and CS into a single framework to suppress outliers in the 
CS recovery. To solve the newly formulated robust CS problem, Pham and Venkatesh suggested 
a scheme that iteratively solves a number of CS problems, the solutions from which converge to 
the true robust compressed sensing solution. However, this scheme is rather inefficient as it has to 
use existing CS solvers as a proxy. To overcome limitation with the original robust CS algorithm, 
we propose to solve the robust CS problem directly in this paper and drive more computationally 
efficient algorithms by following latest advances in large-scale convex optimization for non-smooth 
regularization. Furthermore, we also extend the robust CS formulation to various settings, including 
additional affine constraints, £i-norm loss function, mixed-norm regularization, and multi-tasking, 
so as to further improve robust CS. We also derive simple but effective algorithms to solve these 
extensions. We demonstrate that the new algorithms provide much better computational advantage 
over the original robust CS formulation, and effectively solve more sophisticated extensions where 
the original methods simply cannot. We demonstrate the usefulness of the extensions on several 
CS imaging tasks. 

1 Introduction 



Compressed sensing (CS) [TJ, [13] is a powerful sub-Nyquist sampling theory for the acquisition and 
recovery of sparse signals, that has received special attention in signal and image processing as well as 
other related fields such as statistics and computer science. The CS theory states that if the unknown 
signal is inherently sparse, then it is possible to acquire and reconstruct signal (by solving a convex 
optimization problem) with a much lower number of measurements that would be otherwise needed 
under the existing Nyquist sampling scheme. In image processing, the CS theory is particularly rele- 
vant in several applications, such as magnetic resonant imaging (MRI) [28] or hyper-spectral imaging 
[10[,I17|. where acquisition time and/or sensing hardware cost play a significant role. Also, the sparsity 
assumption typically holds due to, for example, inherent wavelet structure in images [36] . 

In recent years, the CS literature has seen seen significant advances in both theory [31 W$\ [23] 
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and applications [TBI E2 E2J EH E3 EH ED] (many of which are collected in the CS repositor jo) . 
There are also a variety of specialized solvers for the CS recovery problem, which are developed 
from different angles, such as pursuit algorithms [12], [30], [32], optimization algorithms [18j|27], a 
complexity regularization algorithm [21] . and Bayesian methods |26j. 

In this work, we focus on a particular aspect of CS recovery, wherein the emphasis is on robustness. 
This is originally raised by Pham and Venkatesh [33] . They recognize that existing CS recovery schemes 
can be statistically inefficient when the corruption of CS measurements is modeled as impulsive noise. 
Such impulsive corruption can occur due to bit errors in transmission, malfunctioning pixels, faulty 
memory locations [9], and buffer overflow [TJJ], and has been raised in many image processing works 
[2| 111] 158]. To address this problem, Pham and Venkatesh [33] proposed a new formulation, known as 
robust CS, which combines traditional robust statistics [23] and existing CS into a single framework to 
effectively suppress outliers in the recovery. Whilst the focus of [33] is on the theoretical justification 
of the new formulation, they also suggested a provably convergent algorithm to solve their robust 
CS formulation. This majorization minimization (MM) algorithm finds the robust CS solution by 
iteratively solving a number of CS problems, the solutions from which converge to the true solution. 
However, this is not computationally efficient because each iteration involves a full CS recovery, which 
is always iterative in nature. 

To overcome the computational limitation of the original robust CS algorithm proposed in [34] . we 
propose two new algorithms that directly solve the robust CS formulation. They both have only one 
main loop and iteratively majorize the original robust CS objective function. One algorithm is adapted 
from the fast iterative shrinkage thresholding (FISTA) framework developed by Beck and Teboulle [3], 
which shares the same spirit as an unpublished work of Nesterov [33]. The other algorithm is based 
on a framework known as alternating direction method of multipliers (ADMM) [5]. Even though 
the original FISTA scheme was derived for the original CS problem, it can be used for robust CS. 
Our contribution is a theoretical result that allows one to compute the Lipchitz constant for the 
application of FISTA. Additionally, we also derive a generalized ADMM algorithm for solving the 
robust CS formulation efficiently, which differs from the FISTA algorithm in that operator splitting 
and approximation updates are used. This results in a method that has same update complexity as 
FISTA, but is more flexible to extend. 

Furthermore, we also extend robust CS in a number of directions, including additional affine con- 
straints, £i-norm loss function, mixed-norm regularization, and multi-tasking. We show that the 
ADMM is a powerful optimization framework for the robust CS problem as it can be modified or 
generalized to cope with these extensions, where often other CS techniques, including FISTA, find im- 
possible to do so. We show that the derived algorithms are simple to implement, provably convergent 
under the ADMM theory, and that they effectively solve complex robust CS formulations. 

The paper is organized follows. Section II gives some background on robust CS, whilst Section III 
describes the FISTA and ADMM algorithms for solving the robust CS formulation. Section IV presents 
four extensions of the robust CS formulation and derive computationally efficient algorithms for solving 
them. Section V contains numerical experiments to demonstrate the computational efficiency of the 
proposed algorithms. Finally, Section VI concludes the paper. 

All Matlab code to implement our methods described in this paper and reproduce our results is 



readily available at the following website http://www.computing.edu.au/~dsp/code.php 



^ttp: //dsp. rice . edu/ cs 
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2 Background 



In compressed sensing (CS), one is interested in the recovery of a sparse signal x £ M. N though the 
compressed measurement 

y = *x + n. (1) 

Here, 6 M> MxN is the CS matrix that represents the compressive sampling operation and n is 
additive noise. The CS matrix is required to some stable embedding conditions for stable recovery 
[BJ. As M < N in the CS setting, the recovery of x from y is generally ill-posed. The CS theory has 
established that under an assumption that x is sparse, it is possible to recover x reliably from y with 
an error upper bounded by the noise strength. Among various approaches to solve the CS recovery 
problem, the optimization formulation often provides the best achievability for a given CS matrix 

x = arg min j — ||y — 4>x||| + A||x||i 1 . (2) 



In the normal CS setting, the noise in ([T]) is often considered Gaussian with bounded norm 1 1 n 1 1 2 < £ 
and thus the maximum error induced by a CS recovery is C( 1 1 n || 2 ) - However, Pham and Venkatesh 
[34] have discovered that when the noise is indeed impulsive, such a result will still hold for normal 
CS recovery but is rather inefficient. Thus, they propose a modification to the CS formulation, known 
as robust CS, to appropriately address the characteristics of the underlying additive noise. This is 
achieved by considering the robust loss function instead of the quadratic cost function in ([2]) 

x = arg min {g(x) + A||x||i} . (3) 
Here, <?(x) = Y2i=i PiVi ~ (^ x )«) an d p{f) is the Huber's penalty function (soft limiter) given as follows 



\r\ < kv 2 
' ku 2 \r\ \r\ > ku 2 , 



P(r) = { 
and its derivative is given by 



r |r| < kv 2 

/cz/ 2 sgn(r) \r\ > kv 2 . 

The parameter k of the Huber's penalty function is determined by the fraction of the outliers whilst 
the scale parameter v is often estimated from some statistic of the median, such as the median of the 
absolute deviation (MAD). For detail, see [23] . 

As p(r) is quadratic or linear depending on the actual value of r, solving ([3]) directly is not trivial. 
Pham and Venkatesh suggested that instead of solving Q, a better alternative is to solve a series of 
the normal CS problems. The idea is to replace <?(x) with an approximate quadratic function at every 
outer iteration with the general form 

/ fc (x) = (l/2)(v fc -*x) T W(v fc -*x) + C (6) 

= (l/2)||W 1 / 2 v fe - W 1/2 $x||! + C, (7) 

where 

C = g(-k k ) - (i/2)V(y - *x fe ) T w-V(y - *x"), (8) 

v fc = w~ 1 -0(y - *x fc ) + *x fc . (9) 

Pham and Venkatesh detailed two options for W, which are commonly used in the robust statistics 
literature 



3 



• Modified residuals (MR): W = fjl 

• Iteratively reweighted: wu = ip(rf)/rf, Wij=o,i ^ j. 

When using Z fc (x) as shown in ([7J) for <?(x) in 0, the resultant problem is essentially a normal CS 
problem and thus considered solved. 

Whilst the above strategy will work, it is inefficient because each outer iteration involves a full CS 
problem and it is known that the CS problem needs to be solved iteratively as well. Therefore, the 
double loops are the main computational deficiency of the above strategy. To address this limitation, 
we consider bypassing the inner CS step and thus there will be only one loop for the overall algorithm. 
There are two powerful optimization frameworks that are suitable for this purpose, which we describe 
next. 



3 Proposed Algorithms 
3.1 FISTA Algorithm 

Fast iterative shrinkage thresholding (FISTA) is an optimization approach that effectively decouples 
the variables from the smooth loss function in the compressed sensing objective. This approach was 
proposed by [3], which also shares the same philosophy as an unpublished work of [33]. Technically, 
FISTA is a variant of majorization minimization (MM) algorithms [25] and has a special choice for 
the quadratic majorization as well updates that involve historical points. 

Consider minimizing a convex optimization of the form argmin x /(x) where 

/(x) = <?(x)+i?(x). (10) 

Here, g(x) is a smooth loss function, but the variables in this loss function are coupled. The core 
idea of FISTA is to consider a quadratic majorization of g(x), denotes as /i(x), such that it effectively 
decouples the variables. If such decoupling is possible, the approximate problem is then easier to solve 
even when the regularization term i?(x) is possibly non-smooth (such as ||x||i), because it can be 
decomposed into a number of univariate optimization problems whose solution is analytical. 

The first trick of FISTA is to decouple the variables by considering the majorization at iteration k 
and approximation point z k 

h(x;z k ) = g(z k ) + Vg(z k ) T (x - z k ) + |||x - z k f 2 . (11) 

Here, z k is used as the approximation point rather than x fc as it involves historical updates of x fc by 
a careful choice, which is subsequently show in (|17j) . Also, L is the Lipchitz constant of the gradient 
of the loss function <?(x) to ensure that h(x) is a proper majorization of g(x). Thus, at iteration k, 
FISTA finds x fe via 

x fc = argmin |^||x - v fc ||2 + i?(x)| (12) 

where v fc = z k — (l/L)Vg(z k ). For the quadratic loss function g(x) = ^||y — ^x|||, it can be shown 
that L = 2A max (3? T <I?), and v = z k — (l/L)($ T $z fc — y). For the £i-norm regularization as in the 
case of CS, this results in 

x fe = argmin | — ||x — v||2 + A||x||i 1 . (13) 
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This problem can be solved element-wise and its solution is 

x fc = S A/L (v), (14) 
where the soft-thresholding shrinkage operator is defined as 

S r (x) = {t:ti = sign(xi) max(\xi \ -r,0)}. (15) 
The second trick of FISTA is to use a clever update of the approximation point to speed up convergence 



= hvWF (16) 

= x'+f^^-W-x*- 1 ). (17) 

The original FISTA framework can be readily used for robust CS case if v fc and the Lipchitz 
constant can be computed for the robust loss function. In case of v fc , it can be easily seen that 

v k = z k -y& T ij>(&z k -y). (18) 

It remains to compute the Lipchitz constant. To do so, we rely on the following result: 

Lemma 1 Let f{x) be a smooth convex function on X and suppose that the domain X is divided into 
two regions X\ and X^, such that f(x) = g(x) if x G X\ and f(x) = h(x) if x € X2, and X\ U X2 = X , 
and that g(x) = h(x) for x £ X\ n X<i- Denote as L g and Lh the Lipchitz constants of g and h 
respectively on the domains X\ and X2 ■ Then the Lipchitz constant of f is bounded by 

L f < {L g + L h }. 

The proof of this Lemma is detailed in the Appendix. The result implies that for mixed functions 
like the robust CS loss functions being considered, we just take the sum of Lipchitz constants over 
each continuous and bounded domain. The Lipchitz constant for the quadratic part is as before, i.e. 
L 1 = 2A mgjf^^), whilst for the linear part we can split into negative and positive domain. In both 
cases, the Lipchitz constant is zero due to the fact that tp is a constant. Thus, the Lipchitz constant 
for the robust CS cost function is still 2A max (<&- r <I>). 

3.2 ADMM Algorithm 

Alternating direction method of multipliers (ADMM) is a simple but powerful framework in optimiza- 
tion, which is suited for today's large-scale problems arising in machine learning and signal processing. 
The method was in fact developed a long ago before advanced computing power was available, and 
re-discovered many times under different perspectives. Recently, [5] has unified the framework in a 
simple and concise explanation. In either the CS or robust CS problem, the main technical challenge 
is that the variables are coupled through in either the quadratic or robust loss function. This makes 
it rather difficult when the extra constraint with non-smooth t\ norm is introduced. In principle, 
the problem is easier to tackle if the variables can be decoupled, so that the problem can be solved 
element-wise or group- wise. Using a clever trick, known as operator splitting [15], the ADMM frame- 
work suggests to separate the regularization term from the smooth term by introducing an additional 
variable z, which is tied to the original variable via an affine constraint: 

min XjZ g(x) + ||z||i s.t x — z = 0. (19) 
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Here, <?(x) is the robust CS loss function. For this type of regularized objective function, ADMM 
considers the following augmented Lagrangian 

£(x,z,y) = 5 (x)+A||z|| 1 +w T (x-z) + |||x-z||2. (20) 

Here, r\ is the parameter associated with the augmentation 3||x — z|||, and this is to improve the 
numerical stability of the algorithm. The strategy for minimizing this augmented Lagrangian is 
iterative updating of the primal and dual variables. With a further normalization on the dual variable 
u = (l/ry)w, it is shown [5] that as far as the primal and dual variables x and z are concerned 

£(x, z; u) = g(x) + A||z||i + ^||x — z + u||| + const. (21) 

where the constant is independent of x and z (actually const = — r/||u|||/2). Note of the semi-colon, 
which treats u as a parameter rather than a variable when solving for other variables. Thus, the 
optimality point of the Lagrangian can be found by iteratively updating the variables as follows: 

argmin{ 9 (x) + |||x-z fc + u fc ||^} (22) 

argmin jA||z||i + | ||x fc+1 - z + u fc 1 1 1 1 (23) 
u fc + x fc+1 -z k+1 . (24) 

We note that the update steps for u and z are straightforward. In particular, for z it is known that 
it is a soft-thresholding shrinkage operation 

z k+1 = S A/r ,(x fc+1 +u fc ). (25) 

Due to the nature of <?(x), there is no exact solution for (|22|) . and finding it always necessitates 
iterative algorithms. This will increase computational burden to the overall algorithm in a similar way 
as the previous robust CS algorithms introduced in [34] . To alleviate the computational problem, we 
propose to follow a novel framework, known as generalized ADMM and developed by Eckstein and 
Bertsekas [15]. In generalized ADMM, the update steps can be solved approximately as long as the 
differences between the exact and approximate solutions generate a summable sequence. When such 
a condition is satisfied, the generalized ADMM theory has proved that the algorithm will converge to 
the solution [151 Theorem 8]. 

To utilize the generalized ADMM theory, once again we adapt an MM algorithm to solve (|22p . 
which is in the same spirit as the original robust CS [33]. In essence, this replaces <?(x) with a suitable 
quadratic majorization as discussed previously. The major difference is that we only perform the 
minimization of the majorization once, as opposed to iteratively as in [53]. Specifically, we propose to 
modify the update step for x in (|22p by using the quadratic approximation of g(x) at iteration k as 
l k (x) (shown in ©) 

x fc+1 = argminri^llW^v^-WV^xlll + ^llx-z^ + u* 1 !!!. (26) 
x 2 

It can be easily recognized that the solution of this problem is exact 

x fc+l = ($T W qT + r?I )-l($r Wv fc + ^ z k _ u fc))_ ( 27 ) 

We note that the quadratic approximation of FISTA can also be used. However, the choice above leads 
to a better approximation and hence will converge to the true solution faster. It is also easily seen 



x fc+1 = 

z fe+1 = 
u k+1 = 
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that for the MR choice of the quadratic approximation where W = /xl, the matrix under inversion in 
(127]) is fixed 

x fc+1 = (n$ T $ T + r^V^V 2 + rj(z k - u fc )). (28) 

Hence, the inversion + ^I)^ 1 can be computed once and cached so that the update step in 

subsequent iterations can be fast. 

The generalized ADMM for the specific case being considered can be stated as follows: 

Theorem 1 Consider an ADMM algorithm that solves the convex problem (0j via the updates H28\) . 

Denote as x^ +1 the exact solution of HUty, and as x fc+1 the approximate of l^ffl) via 
If the sequence {fJ-k+i '■ Hk+i = ||x fc+1 — x£ +1 ||2} is summable, i.e., X)a^=i < °°j then the above 
updates will generate a sequence {x fe+1 } that converge to the true solution of 

Next, we discuss the convergence stopping condition of the proposed generalized ADMM algorithm. 
When the update steps are solved exactly, the existing ADMM theory [5] states that the penalty 
parameter rj affects both the primal residual (defined as s fc+1 = rj(z k+1 — z k )), and the primal residual 
(defined as r fc+1 = x fc+1 — z fc+1 ) in an opposite manner: a large r\ tends to generate a small primal 
residual and a large dual residual and vice versa. Thus, selecting the optimal penalty parameter is 
typically a trade-off between primal and residual residuals with an ADMM algorithm, and r] = 1 
generally works for most cases. However, more emphasis should be made to the primal residual in the 
case of the proposed generalized ADMM algorithm because the update step of the primal variable x 
is not solved exactly. This will ensure that the approximation error in the primal variable is promptly 
compensated by the dual update, at the small sacrifice in convergence rate due to the residual error 
being slightly larger. Intensive numerical studies suggest that a value for r/ of between 2 and 5 for r] 
works rather well in many cases. We shall examine this in more detail in the experimental section, 
where we use r\ = 2. For stopping condition, we terminate the algorithm when the primal and dual 
variables are sufficiently small. For standard settings of absolute and relative tolerances please see [5]. 

4 Beyond Robust CS 

The FISTA and ADMM algorithms for robust CS presented tackle the optimization from slightly 
different angles. Whilst FISTA solves the problem by replacing the robust cost function with a simpler 
quadratic approximation that decouples the variables, the ADMM decouples the l\ regularization norm 
via operator splitting. Whilst FISTA has only one approximation, ADMM involves operator splitting 
and quadratic approximation at the step that updates x. Thus it appears that FISTA may have a 
convergence advantage due to being simpler and having less tuning requirements. However, numerical 
experience indicates that for a given tolerance, the ADMM algorithm is actually faster than FISTA 
in terms of both number of iterations or computational time to reach a given tolerance. This will be 
illustrated further in the experimental section. 

The advantage of ADMM is better realized when one needs to extend robust CS in similar ways as 
many extensions on the basic CS have been made in the literature. This is difficult, if not impossible, 
with the FISTA scheme. Next, we discuss several possible extensions that can be simply achieved 
with the proposed ADMM algorithm. 
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4.1 Additional Affine Constraints 

In some cases, one would like to impose additional affine constraints on the optimization problem 
c r x = 1. This could be of prior knowledge on the power modeling (i.e., when Xi is known a priori) 
and this could potentially improve stabilization of the CS solution. Thus, the Lagrangian (|20p could 
be altered as follows 

£(x,z,yi,y 2 ) = #(x) + A||z||i + Wi(x-z) + y||x-z||! 

+w 2 (c T x-l) + ^||c T x-l||2. (29) 

Here, wi and W2 are the dual variables for the equality constraints. Again, by scaling the dual variables 
ui = W1/771 and 112 = W2/r]2 we obtain 

£(x,z; Ui,u 2 ) = g{x) + A||z||i + y||x ~~ z + Ul H2 + ^ll cTx ~~ 1 + ""2 111 + const. (30) 

Thus, the ADMM update step for x is the solution of the problem 

x fc+1 = argm x in{ 5 (x) + |||x-z fc + u^||2 + |||c r x-l+^||l}. (31) 

Once again, if this step is to be solved approximately using a quadratic majorization with W = fjj as 
discussed previously then it can be shown that 

x fe+1 = H(/^ T v fc + m (z k - u k ) + 173(1 - u *)c), (32) 

where H = (/i3> T W<I> + rji I + 77 2 cc T ) _1 . 

It can be shown that the updates step for z remains the same as (|25|) except that u and r\ are 
replaced with ui and r}\ respectively. Finally, the updates of the dual variables are 

u^ +1 = u k + x k+1 -z k+1 , (33) 
u k+1 = «* + c T x fc+1 - 1. (34) 

Just like the basic ADMM algorithm, convergence is determined when both the primal and dual 
residuals are sufficiently small. Whilst the dual residual is as before, i.e., s = r]i(z k+1 — z k ), there are 
effectively two residual vectors r k = x. k — z k and r\ = c T x k — 1. Depending on the desired accuracy 
requirement of a particular application, the stopping criterion can be determined accordingly (see 
p.19]). 

4.2 Mixed-Norm Regularization 

In certain situations, one may wish to impose £2 regularization on the solution of the recovery. Such 
a motivation may arise from the fact that the absolute sparse model may not be realistic, and thus it 
is more desirable to consider 

x = arg min |g(x) + A||x||i + ^||x|||} . (35) 

Even in the sparse case, an additional quadratic regularization with a small /3 could improve numerical 
stability against rank deficiency of 3>. In the case of quadratic loss function, i.e., <?(x) = ^||y — <&x||2, 
this is known as the elastic- net [H]. Thus, the proposed formulation could be interpreted as a robust 
version of the elastic-net. The robust CS formulation is treated a special case when f3 = 0. 
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For the original elastic-net, it is easily recognized that a simple algebra can convert it to a Lasso (or 
CS) form, and thus it can be solved with many efficient ^i-regularization algorithms. For the proposed 
robust elastic-net, it is not possible because of the loss function <?(x) being not quadratic. However, it 
is trivial to show that it is possible to modify the FISTA and generalized ADMM algorithms discussed 
in the previous section to cater for this additional regularization term. Indeed, this regularization 
term only affects the update step of x. In both FISTA and generalized ADMM, the majorization is a 
quadratic function and thus absorbing this extra quadratic term is straightforward. For example, in 
the case of the FISTA algorithm, we need to solve (c.f. (|12p 



min (l/2)||v-x||2 + (A/L)||x|| 1 + (/3/L)||x||2, (36) 

X 

which is equivalent to 

1 



mm — 

x 2 



x 



[ + ih Mi - (37) 



1 + ifi/L) 

which is of the same form and this induces the soft-thresholding shrinkage operation. Likewise, in the 
case of the generalized ADMM algorithm, we need to to solve (c.f 



x fc+1 = argmin(l/2)||W 1 / 2 v fc -W 1 / 2 *x||^ + ^||x-z fe + u fc ||| + |[/3|||, (38) 
x 2 



and thus this has only a slight modification compared with ([2] 

x fc+1 = (^* T * T + (7/ + ^I)" 1 ^* 2 ^* + v(z k - u fc )). (39) 
Thus, extension to mixed-norm regularization is straightforward of the proposed ADMM algorithm. 



4.3 t\ Loss Function 

In the original robust CS paper [33], the Huber loss is selected. This is suitable for impulsive noise 
being modeled as a contaminated mixture [23]. However, the robust CS framework is not necessarily 
restricted to the Huber loss function and indeed many loss functions in the robust statistics can be used 
to cater for different noise types. One particular interest is the £i-norm loss function, which is optimal 
when the impulsive noise is modeled as a Cauchy distribution [24] . In this case, <?(x) = ||y — «frx||i 
and thus it is desirable to solve 

x = arg min {lly — *&x||i + Allxlli} . (40) 

We note that the FISTA algorithm is not easily derived, because the loss function is not differentiable. 

To overcome the difficulty associated with two parts of the objective function that are both non- 
differentiable, we propose to apply the operator splitting mechanism of the ADMM framework twice. 
Specifically, we introduce two additional variables v and z and rewrite the formulation as 

arg min ||v||i + A||z||i 

x,v,z 

s.t. <&x — v — y = 

x - z = 0. (41) 

Thus, the augmented Lagrangian is 

£(x, v, z, wi,w 2 ) = ||v||i + A||z||i + wf (*x - v - y) + y ||*x - v - y||| 

+wi , (x-z) + ^||x-z||l. (42) 
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With the scaled dual variables ui = wi/r/i and 112 = W2/772, we can rewrite 

£(x, v, z;ui,u 2 ) = ||v||i + A||z||i + ^||*x - v - y + U1II2 + yII x ~ z + u 2lll + const - (43) 

With this form, the updates for the variables are easily computed under the ADMM principle. For x, 
the update solves the problem 

x fc+1 = argmin^-||*x- v fc - y + Ui'||| + y ||x - z fc + u^||| (44) 

which yields the exact solution 

x fc+1 = ( Vl ^ T ^ + m iy\ m ^ T ^ k +y-u k 1 ) + r l2 (z k -u k 2 )). (45) 

For both v and z, it is easily recognized that the update steps are simple soft-thresholding operations. 
For v, the update step solves 

v fc+1 = argmin ||v||i + — ||t fe - v|||, (46) 
v 2 

where t k = $x fc+1 — y + u k . Likewise, for z the update step solves 

z k+1 = argminA||z||i + ^||x fc+1 +u^-z|||. (47) 
z 2 

They both have a similar form as (|23p . and thus from (|25[) we deduce (c.f. ()15p ) 

v fc+1 = S 1/T)1 (t fc ), (48) 
z k+1 = S A/)B (x* +1 + u£), (49) 

as the updates for v and z. Finally, the dual updates are 

u^ +1 = u k + $x fe+1 - v fc+1 - y (50) 
u k+1 = u£ + x fc+1 -z* +1 . (51) 

The stopping criterion is when the residual vectors are sufficiently small, including = r/i(v fc+1 — v fe ), 
= rj 2 (z k+1 - z k ), r k = x fc - z k , and r k = $x fc - y - v fc . 

4.4 Multi-Task Setting 

The recent literature on CS also reveals that the basic sparsity recovery scheme can be improved if 
one exploits further domain knowledge. Such an exploitation could be based on the constraint of the 
sparsity models. Extensions, such as model-based CS [3] and group sparsity [23], are key examples 
of the exploitation that can effectively reduce the CS requirements for a comparable recovery error 
when compared with conventional CS. Here, we focus on a slight variation where there are multiple CS 
tasks to be performed: there are multiple CS measurements yi,i = 1,...,L, each follows the model 
yi = *Xj + rij. 

In the image processing context, this could arise in, for example, compressed sensing of multiple 
video images. In these circumstances, there many be similarities between images. For example, moving 
images likely consist of relatively same large background and small moving objects. Thus, the sparse 
representation of these original images may have similar sparse coefficients representing the common 
background part (see Fig. [T] for an illustration of a sequence of random bars images used later in the 
experiment). For that reason, it follows from the existing results on advanced CS |23| that exploiting 
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Haar wavelet coefficients of a sequence of random bars images 




the shared structure between tasks is likely to improve CS recovery compared to the case where the 
tasks are performed independently. 

Denote as X = [xi, . . . , x^] the collection of sparse vectors to be recovered from the tasks, and 
Y = [yi, . . . , vl] the collection of CS measurements. Extending the single-task robust CS, the multi- 
task robust CS can be formulated as follows 

* = ar S x ^, i^ X ) + A II XT UW • (52) 



Here, g(X.) = Yli=i P(y« ~~ * x «) an d ||A||^ 2 /^ 1 = ^ 1 1 a.^ 1 1 2 where a^'s denote the columns of A. Clearly 
the loss term is the same, whilst for the regularization terms, we seek sparsity along the columns of X 
but denseness along the rows of X. This clearly reflects the prior assumption that sparse coefficients 
of the common parts are likely to be similar, hence the corresponding rows of X should be dense, 
whilst it is sparse column-wise to respect the single-task CS's assumption. When g(X) is a quadratic 
loss function, this is a special matrix formulation of group Lasso in the statistics literature [T| 1291 [39] . 

We now show that it is possible to extend both the FISTA and generalized ADMM algorithms 
to cater for this formulation. Before doing so, we present a generalization of the soft-thresholding 
shrinkage operation as follows: 

Lemma 2 The optimization problem 

argmin |a||z|| 2 + ^||v - z|||| . (53) 

has the solution z = v x max(||v||2 — A/77, 0) / 1 1 v 1 1 2 



This result can be proved by simple geometrical arguments. Indeed, denote z* as the solution of ()53|) . 
then we consider all points z such that ||v — z 1 1 2 = ||v — z* || 2 = R- It turns out that these points are 
lying on the ball with center at v and radius R. Among these points, only the point that satisfies 
z = av, i.e., intersection of the ball and the vector v, will have minimum £2 norm, which minimizes 
the second term in ([53|) . Substituting this into ([53]) yields the form of the soft-thresholding shrinkage 
problem, for which the result is obtained after simple manipulations. 

4.4.1 FISTA algorithm. 

Generalizing (fl~2|) for the multi-task settings, denote as V fc = [vf, . . . , v£], where = z^ — ^^^(^z^- 
yi). Then, the update step for X solves 

argmin -|| V - X||| + A||X T || £2/£l . (54) 

This problem can be written row-wise in the form of (|53p and thus the solution is exact. Meanwhile, 
the update step for Z is also similar 

Z k+i = x fe + ^ll(X fe -X fe - 1 ), (55) 
tk+i 



with t k+1 = (1 + ^/l + 4(t fc ) 2 )/2. 
4.4.2 ADMM algorithm. 

We rewrite the Lagrangian for the current setting as follows 

£(X,Z;U) = 5 (X) + A||Z T || V£l + ^ ||X - Z + U||| + const. (56) 
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Thus, the ADMM update steps are 

X fc+i = ar g min{ 5 (X) + ^||X-Z fe + U fc ||U (57) 

Z k+1 = a rgmin{A||Z T ||, 2/ , 1 + |||X fe+1 +U fc -Z|||} (58) 
n fc+i = ufc + x fc+1 - Z k+1 . (59) 

Like FISTA, the update step of Z can easily be decomposed row-wise, each has the form of (f53j) . and 

thus the solution for each row of Z k+1 can be obtained immediately. For the update step of X, again 

we resort to the generalized ADMM principle. That is, we approximate <?(X) with a quadratic loss at 
X fe 

h(K;X k ) = 5 (X fc ) + Vx5(X fe f(X-X fc ) + |||X-X fc ||^ (60) 



Thus, the generalized ADMM algorithm finds the update via 

X fc+1 = argmin{v X 5(X fc ) T (X-X fc ) + |||X-X fc ||^ + |||X-Z fc + U fc ||^}, (61) 

which yields the following solution 

X fc+1 = (/i* T $ + rjl)- 1 (ji* T V k + r](Z k - U fe )), (62) 

where V k = [v k , . . . , v|], v fc = ±V(yi - *x 4 fc ) + *x 4 fe . 

The stopping criterion is when all primal and dual residual matrices are small, they include 

S fc+1 = r](Z k+1 -Z k ) (63) 
K k+i = X fc+1 -Z fc+1 . (64) 

Like the single-task case, one should set r\ sufficiently large to obtain a smooth decrease of the objective 
function. 

4.5 Discussion 

Further extensions. We have presented some fundamental extensions of the CS formulation. Under 
the ADMM frameworks, it appears that it is possible to consider extensions based on the combination 
of the basics extensions presented. For example, the t\ loss could be used with affine constraint or in 
multi-task setting, etc. Such extensions will be worthwhile investigation for future work. 

Regularization Path. In practice, the optimal value of the regularization A is not known in advance, 
and thus one needs to select a proper value to do robust CS recovery. Such a problem is known in 
statistics as model selection. Typically, one needs to compute the recovery along the regularization 
path, and select the one which meets the i\ norm constraint. This is discussed in detail in |34j . 
Essentially, some estimates of the noise statistics must be obtained in order to construct the bound on 
the residual e. It is well-known that there exist a A max = ||3? T y||oo above which the solution is zero. 
For decreasing values of A, the residual r = y — <l>x will become smaller whilst the recovery becomes 
denser. The optimal A is the maximum value of A such that the bound constraint on the residual 
vector is met. In CS recovery, this happens when 1 1 1 1 2 — £ -> whilst in robust CS recovery, Pham and 
Venkatesh [M] have suggested p(r) < e, which is a generalization of the CS selection criteria for the 
robust case. In our implementation, we combine a coarse grid search and a fine bi-section search to 
find this optimal A (see Fig. [2] for an illustration). 
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Regularization path for robust CS - Random bars example 
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Figure 2: Regularization path of robust CS for random bars example 
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Figure 3: Convergence property of the compared robust CS algorithms 

Cholesky Factorization. As can be seen, most update step of x in different ADMM variants involves 
the computation of the form x fc+1 = (fi<& T & + Q) _1 q where Q is an positive definite matrix. The 
matrix under inversion has a size of iV x N and it is large in image processing application. Thus, it 
is inefficient to compute the inversion directly to obtain the update. A much more efficient approach 
is to use Cholesky decomposition to achieve the goal. It is known from linear algebra that if H is a 
positive definite matrix then it admits the factorization H = LL T and thus H _1 q can be efficiently 
computed by solving Lxi = q first, then L T x = xi, which can be written as x = L r \ (L \ q). 
For compressed sensing applications where $ is a fat matrix, further exploitation can be made by 
reducing the dimension of the matrix for Cholesky factorization. Indeed, according to the matrix 
inversion lemma 

(/x* T * + Q) _1 = Q _1 - Q _1 * T P _1 (p*Q _1 ), 

where P = I + /i<&Q _1 <I> T . Suppose that the Cholesky factorization of P is P = LL T then 

(^ T & + Q) _1 q = Q-^q - * r (L r \ (L \ (/xSQ^q)))). 

We can avoid the direct inversion of Q by exploiting the fact that if Q = p±I + p2^c T then the matrix 
inversion lemma once again gives 

Q- 1 = p^I - 7 cc T , 

where 7 = pj~ 2 (p2 1 + Pi lcTc )- Finally, we note that this Cholesky factorization is independent of 
the regularization parameter A and thus it can be cached for the whole regularization path to reduce 
computation. 

5 Experiments 

5.1 Comparison of Numerical Properties 

We examine the convergence property of the FISTA and ADMM algorithms and compare them with 
the previously proposed method in [M], which we refer to as nested robust CS algorithm due to the 
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Figure 4: Image recovery - random bars example 
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nature of the double loops inside that algorithm. As the nested robust CS algorithm [33] is dependent 
on the particular CS solver being used for the inner loop, we select the ADMM implementation as 
the CS solver because it provides the best computational accuracy and speed. Note that Pham and 
Venkatesh [S3] used the ll_ls algorithm originally, which is known for high-accuracy but computa- 
tionally expensive. However, numerical experience shows that the inner steps do not required to be 
solved with high accuracy. Thus, the ADMM implementation as a CS solver for the nested robust 
CS algorithm is better overall. In this case, it can be seen that the computational complexity per 
iteration (regardless of inner or outer) in all compared algorithms are approximately the same: they 
all involve the computation of the majorization point and the soft-thresholding shrinkage operation. 

To compare the algorithms, we examine two aspects: the error versus the iterations and the com- 
putational time taken to achieve a particular tolerance. Whilst the former indicates how fast an 
algorithm converges, the latter provides a much valuable insight for practical purpose. To do so, we 
let all algorithms run for sufficiently large number of iterations and measure the error (with respect 
to the true value of the robust CS solution) as iterations go on, and the computational time taken 
when the error reaches certain thresholds. For the ADMM-based CS solver used in the inner loop of 
the nested robust CS algorithm, we select the termination with relative tolerance of 10~ 2 and abso- 
lute tolerance of 10 -4 (see [SJ p. 19]). This allows a reasonable convergence within the inner loops. 
We also choose the modified residual approach for nested robust CS as it is simpler without loosing 
convergence advantage. All algorithms are implemented in Matlab, and roughly optimized. 

We revisit the random bars example in [33] (see also Fig. |3|) and the results of this study is shown 
in Fig. In this example, the signal to noise ratio is 20dB and the impulsive noise is modeled as 
a two-component Gaussian mixture model where the there is 10% contamination whose variance is 
K = 100 times that of the main component. Here, the left subplot shows the reduction of the error 
versus the iterations, whilst the right plot shows the time taken to achieve the relative accuracy from 
initialized zeros (as indicated by 1E0) to as small as 10 -10 of the initial error (as indicated by 1E-10). 
We note the error profile of the nested robust CS algorithm ranges considerably due to the fact that 
we measure with respect to the global solution of the outer loop and that within each CS inner loop 
the algorithm still converges normally. 

Clearly the error profile plot indicates that the ADMM algorithm offers the best convergence speed 
per iteration, followed by the FISTA algorithm. For example, to achieve an accuracy of 10 -5 of the 
initial error, it only takes the ADMM algorithm less than 100 iterations, whilst the FISTA algorithm 
needs to spend more than 20 times, and the nested algorithm would need 200 times the number of 
iterations. In terms of the actual time taken to achieve a particular tolerance, the right subplot further 
indicates the advantage of ADMM and FISTA algorithms over the nested one. In practice, one would 
be interested in the tolerance of between 10 -2 to 10 -6 , over which the ADMM and FISTA algorithms 
are observed to be 100 and 10 times faster than the nested algorithm respectively. 

In Fig. [31 we shows the actual image recovery of all compared methods, including the CS, the 
nested robust CS, the FISTA robust CS, and the ADMM robust CS algorithms on this random bars 
example. The original random bars image is shown on the top left subplot, whilst its Haar wavelet 
coefficients are shown on the top right subplot. The results clearly show that all robust CS methods 
achieve an PSNR of about 26dB, which is 1.5dB better that that of conventional CS recovery. We 
note that there is a very minor different between robust CS algorithms, due to different convergence 
termination conditions, which is unavoidable. 
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Figure 5: Convergence behavior of affine robust CS 
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Figure 6: Image recovery of affine robust CS formulation 



5.2 Recovery with Affine Robust CS 

Next, we examine how much improvement can be made to robust CS if the power is known. The affine 
robust CS formulation is slightly different to the robust CS formulation in that additional constraint 
c T x = 1 is imposed, and here we select c = l/(l T x) = x i an d assume that ^ Xi is known. 

First, we examine the convergence behavior of the affine ADMM robust CS algorithm to solve 
this formulation by revisiting the random bars example. In this case, we select pi = pi = 1 and 
let the algorithm run over sufficient number of iterations. The results are shown in Fig. [5j Again, 
the left subplot shows the absolute error against the iterations whilst the right subplots indicates 
computational time taken to reach a particular accuracy. Compared with those of the ADMM robust 
CS algorithm, it can be clearly seen that the affine ADMM robust CS algorithm takes more time to 
reach. This is as expected because there are only minor changes to the update steps of the primal and 
dual variables. 

Next, we examine the actual image recovery of affine robust CS formulation. Once again, the 
random bars example is used and the recovered images are shown in Fig. [6J Here, we compare with 
the robust CS formulation via the ADMM algorithm. The result indicates that there is a slight gain 
in the recovery, though it is rather little. As a result, the recovered images look similar. 
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Figure 7: Convergence of the ^i-loss ADMM robust CS algorithm 
5.3 Recovery with l\ Robust Loss Function 

Next, we demonstrate the robust CS algorithm with l\ loss function rather than the Huber's loss 
function used in [34] , This is useful in situations with very impulsive corruption, where the noise is 
best modeled by a Cauchy distribution. To do so, we revisit the random bars example, but we use 
Cauchy noise instead. For the t\ ADMM robust CS algorithm, the model selection criteria is the t\ 
norm of the residual, rather than the Huber's loss function to reflect the new formulation. Other than 
that, all other experimental settings remain the same. 

First, we examine the convergence behavior of the ADMM robust CS algorithm with i\ loss. Fig. 
[7] shows the typical convergence behavior of the algorithm in terms of accuracy versus iterations (left) 
and computational time taken to reach certain accuracy (right). It is observed that the convergence 
is slower with modest accuracy as compared with the formulation using Huber's loss function. This 
is as expected from ADMM optimization theory due to an increasing number of variables to solve the 
i\ loss formulation. Nevertheless, modest accuracy might be sufficient for many practical situations. 

Next, we examine image recovery quality in Cauchy noise. Fig. [5] shows the image recovery for CS, 
robust CS using nested, ADMM, and ^-regularized ADMM algorithms respectively. Due to Cauchy 
noise, it is of interest to note that the CS completely fails with no meaningful pattern recovered. 
The other nested and ADMM algorithm still maintain reasonably recovery quality with an PSNR 
of around 21dB. The t\ -regularized ADMM algorithm achieves the best result with an PSNR of 
25dB, a significant improvement compared with the other two robust CS algorithms. It is also noted 
that the computational time of the ^-regularized ADMM algorithm is almost equal to that of the 
ADMM robust CS algorithm due to the fact that the update steps of the two algorithms have similar 
complexity. 

Though the l\ loss is primarily used for noise modeling as the Cauchy distribution, it is still of 
interest to examine how it behaves if the noise is modeled as from the Gaussian mixture as used 
previously. We again revisit the settings in the previous experiment and the result is shown in Fig. [9l 
Surprisingly, the £i-loss formulation provides a considerable PSNR gain of 4dB over the Huber's loss 
robust CS formulation. 

Thus, despite having less favorable convergence properties, the robust CS formulation with t\ loss 
still appears a better performer for practical image recovery. 



19 



CS Recovery, PSNR=13.1401dB 
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Figure 8: Image recovery in Cauchy noise 
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Figure 9: Image recovery of robust CS with i\ loss function in Gaussian mixture noise 
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Figure 10: Recovery of sequence of compressed images 
5.4 Recovery of A Sequence of Compressed Images 

Finally, we demonstrate the usefulness of the multi-task robust CS formulation when a sequence of 
10 compressed images corrupted by impulsive noise need to be recovered. Whilst each image in the 
sequence can be recovered separately, the multi-task robust CS formulation suggests that exploiting the 
shared structure between the tasks may provide better recovery. To do so, we consider a sequence of 
random bars frames shown in the top row of Fig. [TO) Here, there are common static random bars and a 
moving block across the frames. Obviously, the wavelet coefficients for common static bars are shared 
between the CS tasks. Only the coefficients corresponding to the moving block distinguish between 
tasks. This is clearly illustrated in Fig. Q] which shows an image plot of Haar wavelet coefficients of 
all 10 random bars image in a sequence: the horizontal lines correspond to common coefficients. 

The settings for the recovery are the same as previous experiments. For robust CS, we select the 
ADMM algorithm, and similarly for multi-task robust CS we also select the corresponding multi-task 
ADMM algorithm. The first 4 recovered images are shown in Fig. [TUJ the second row shows CS 
recovery, the third row shows robust CS recovery, and finally the last row shows multi-task robust CS 
recovery. The actual PSNRs for every frame are shown on Fig. [TTJ Here, we observe clearly that, on 
average, the multi-task robust CS formulation does provide a significant improvement over the robust 
CS formulation, both of which outperform CS recovery considerably. 



21 




Figure 11: PSNR Comparison of CS, robust CS, and multi-task (MT) robust CS 
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6 Conclusion 



We have presented more computationally efficient and extendable approaches to the recently proposed 
robust CS algorithm. We have also extended robust CS formulation in a number of ways, including 
affine constraints, ^i-loss function, and multi-task formulation. For improving computational efficiency 
of robust CS, we found that the (generalized) ADMM robust CS algorithm is the best, then followed 
by the FISTA robust CS algorithm. We also found that imposing affine constraint can provide im- 
provement, though slightly. The striking result is that i\ loss formulation for robust CS seems to 
offer considerable gain over the Huber's loss formulation, despite the fact that its convergence seems 
slower. Finally, in the case where one needs to robustly recover a sequence of compressed images, the 
multi-task formulation is proved to provide additional advantages in terms of both PSNR output and 
computational speed. 



Appendices 
Proof of Lemma Q] 

We start from the definition of the Lipchitz constant as a term such as 

sup |/(zi) - f{x 2 )\ < Lf\x x - x 2 \. (65) 

As there are two possible scenarios x\,x 2 G X\, x±,x 2 G X 2 , and x± G X\,x 2 G X 2 and from the 
definition of L g and Lh, we immediately have 

L f < max{L 9 , L h , L 12 }, (66) 

where L\ 2 is defined as the minimum constant such that 

sup Ipfci) - h(x 2 )\ < Li 2 \xt - x 2 \. (67) 

Let X 3 = X\ D X 2 . For arbitrary x\ G X\ and x 2 G X 2 we construct X3 G X 3 such that it is a convex 
combination of x\ and x 2 , so that \x 2 — x$\ < \x\ — x 2 \ and \x\ — x^\ < \x± — x 2 \. Then using triangle 
inequalities and definitions of L g and L^, we have 

swp\g(xi) -h(x 2 )\ = swp\g(xi) - g{x 3 ) + g{x 3 ) + h(x 2 )\ 

= sup\g(xi) - g(x 3 ) + h(x 3 ) - h(x 2 )\ 

< sup\g(xi) - g{x 3 )\ + \h(x 3 ) - h(x 2 )\ 

< sup \g{xi) - g(x 3 )\ + sup \h(x 2 ) - h(x 3 )\ 

< L g \xi - x 3 \ + L h \x 2 - x 3 \ 

< L g \xi — x 2 \ + Lh\x 2 — x\\ 

< (L g + L h )\ Xl - x 2 \. (68) 
The proof immediate follows from (|66p and ()68p . 
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